CO 



CN 
CN 



C3 



Yule-generated trees constrained by node imbalance 

Filippo Disanto, Anna Schlizio, Thomas Wiehe 

Institutfur Genetik, Universitiit zu Koln; Ziilpicher Strafie 47a, 50674 Koln, Germany 



O ! Abstract 

CN 

The Yule process generates a class of binary trees which are fundamental to many population genetic models 
Q_i' and other applications in biology. In this paper, we introduce a family of sub-classes of ranked trees, called 

•^ Q-trees, which are defined via an imbalance parameter to. For an internal node i, let w, be the size of the 

smallest left/right subtree originating at i. Given at, the relation w, < to holds for all nodes i of a Q-tree. 
When a) is small, trees have a simpler structure, the space of Q-trees is drastically reduced with respect to 
un-restricted Yule trees but several statistical properties are maintained. 

Here, we provide a generating function approach to study, under the w-constraint, some basic combinato- 
rial statistics, for instance the number of subtrees of fixed size. As a case study we focus on the distribution 
of the number of cherries, i.e. the number of subtrees with two leaves. We show that expectation and 



variance of this distribution match those for un-restricted trees already for very small values of a>. 
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1. Introduction 

> 

l/"") ■ Given a direction by time, ancestry relationship between species, individuals, alleles or cells can be 

C""" - - ' depicted as a rooted tree. Of particular interest are binary rooted un-ordered trees. These can be further 

classified into several subclasses. We will consider ranked trees, which are defined below. Our focus here is 
primarily on discrete properties, i.e. tree topology, and less on continuous properties, i.e. branch lengths. 

g i We assume «ha. Tees „ generated b y ,he Yute process ,« ,„ dm.) o, - equivalent* - bv «he 

a bifurcation. Starting with a root each pendant edge has, at any time, an equal probability of splitting into 
two leaves. See Harding |4(] and Steel and McKenzie llOll for further details. 

An important parameter, which has been investigated in several studies, is the number of subtrees of 
^^ ' given size (see McKenzie and Steel [gj, Blum and Francois [1], Rosenberg [8|], Disanto and Wiehe Ojn. The 

first results in this series concerned subtrees with 2 leaves, called cherries (see McKenzie and Steel [g]). 

A different, but also purely topological tree -parameter is imbalance. Several tree imbalance measures 
are in use, among them Colless's index or Sackin's index |7J,l9fl. These measures are summary statistics of 
the degree of imbalance averaged across all internal tree nodes. 

The main goal of this work is to introduce the family of Q-trees and to study some of their enumerative 
and distributional properties. Given a tree t and an integer parameter to > 0, we say that t is a Q^-tree (or just 
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a Q-tree) if co, < co for all internal nodes i. Here, cot is the size of the smallest left/right subtree originating 
at i and the size of a tree is the number of its internal nodes. Q-trees form a subset of un-restricted trees. 
Increasing co reduces the constraint; in fact, if co — co* - \_{n — 1)/2J, the two sets coincide, i.e. all trees of 
size n are actually Q-trees with co > co*. 

An interesting tree characteristic is the depth, i.e. the length of the longest root-leaf path. We show that 
Q-trees, for small co, tend to be deep when compared to the average depth of an un-constrained tree. 

Another remarkable property of Q-trees concerns the number of subtrees: for each «' > co, a tree which 
satisfies the w-restriction has at most one subtree of size n' . 

For small co, it is very unlikely that a Q tree is generated by chance under the Yule process. Despite of 
this, they can represent the entire un-constrained tree space. For instance, focusing on cherries, we show that 
the moments of their distribution under Q-trees converges fast to those of the unconstrained distribution. 

This kind of properties allows one to characterize topological features of trees considering only appro- 
priately constrained subclasses. They are particularly interesting when studying the structure of so-called 
induced subtrees, which appear naturally in subsampling problems, and which are generated by extracting 
only those branches of an existing tree which connect the root to a subset of the leaves. We suggest that 
our approach, which is based on the toolbox of generating functions, can be extended to study higher level 
subtree-statistics and represents a first step towards a theory of induced subtrees. 

2. Preliminaries 

We start with some basic definitions. A binary rooted tree is a tree with a root and in which all nodes 
have outdegree either or 2. Nodes with outdegree 2 are called internal, nodes with outdegree are external. 
External nodes are also called leaves. We consider the size n of a tree to be the number of its internal nodes. 
The subtree of an internal node i is the tree with root i. A tree is said to be un-ordered (in graph theoretical 
sense) if subtrees stemming from an internal node have not a left-right order. Disregarding branch lengths, 
we consider the following class. A binary un-ordered tree of size n is said to be a ranked tree if the set of 
internal nodes is totally ordered by labels { 1 , 2, ...,«} in such a way each child-node label is greater than the 
parent-node label (see Fig. [TV The total order of internal labels can be interpreted as a historical time order. 
To emphasize this Harding |4J] called such trees histories. 

The set of ranked trees of size « is denoted by ft,, and ft = \J„%,. Furthermore, given a tree t, we 
denote by l(t) the number of internal nodes whose children are two leaves. Such internal nodes are called 
cherries of the tree. McKenzie and Steel [s\ have shown that the random variable L, i.e. number of cherries, 
is asymptotically normal for large n with expectation (n + l)/3 and variance 2(« + l)/45. Figf2]shows, for 
several values of «, the distribution of L for ranked trees of size n. 

The w-constraint and some consequences. Let us now introduce Q-trees as a subclass of ft. Fix 
co 6 {0, 1,2, ...,«, ...} and, given a tree t e ft,,, we say that f is a Q-tree if each node i off satisfies 

mm(\t L (i)\,Mi)\) < co, 

where ti{i) (resp. ?«(/)) is the left (resp. right) subtree of i. For fixed co, we denote by Q" the set of Q-trees 
of size n. Observe that Q„ = ft„ for every n. 

If to is small, the constraint has a strong effect on the topology of the resulting trees. A Q-tree looks as 
in Fig. [3] It has an extended back-bone to which "small" trees of size at most co are appended. The length of 
this path, i.e., the number of nodes it contains, is bounded (from below) by (n - co)/(co + 1). For co small, it 
provides a measure of the depth of the tree, where the latter is the number of archs in the longest path which 
connects the root to a leaf. In an un-constrained ranked tree the minimum depth is log 2 (« + 1). Average depth 
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Figure 1 : The sixteen possible ranked trees of size five classified by shape. Within each class all possible orderings of the internal nodes 
are displayed. Number of cherries is indicated. 




Figure 2: Distribution of the random variable L for ranked trees of fixed size n < 100 according to the Yule model. In grey the expected 
value E L ifr(n) = (n + l)/3. 




Figure 3: The shape of a tree in fl . The ordering of the internal nodes is omitted. 
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Figure 4: The average depth (solid line) of 10 6 ranked tree of size n versus the lower bound (n - cj)/(oj + 1) = n/(a) +1) with ui = 3 
(dashed line). The dotted line represents the lower bound for un-constrained trees log 2 (n + 1). 

is depicted in Fig. |4]and was obtained by simulations of 10 6 ranked trees |5J each for n — 10, 20, 30, 40, 50. 
Note, for n sufficiently large, average depth of un-constrained trees is smaller than the lower bound for 
Q-trees. 

The effect of the w-constraint becomes manifest also in the number of different subtrees. Indeed, for 
each n' > to, a Q-tree contains at most one subtree of size «'. The tree shown in Fig.[3]has size 9 and belongs 
to Q 2 . It does not contain any subtree of size 4 and just one of size 3. 

3. The number of ii-trees 

In this section we count the number of the possible Q-trees of size «. In other words, we determine the 
cardinality of Qjj\ Furthermore, recalling that under the Yule model the probability of a ranked tree t of size 
« with I cherries is given by Tajima's weight 111 U |2J] 



2"- 



we also need to consider the number of cherries in our enumerations. 

Let (e„)„>o be the sequence of Euler numbers. They enumerates un-constrained trees (]2j, i.e., e„ — \%,\. 
The first terms of the sequence are 

1, 1, 1,2,5, 16,61,272, 1385,7936,50521, ... 

which means, for example, that there are exactly 50521 different ranked trees of size 10. 



Let us fix a) and note that if t e Q," with n > 2a> + 1 , then t is built appending to a common root a Q-tree 
t\ with |/i] > to and a ranked tree h with k = \tz\ < (0. Finally we need to merge the order of the nodes of 
t\ with the one for the nodes of ?2- This can be done in exactly ("7, J ways since there are no symmetries 
between t t and ti. Thus, considering that for the first 2ai + 1 values we have 

|^| = e 1 ,|^| = e2 ,...,m' w+1 | = e2w+1 , 

we can define, for n > 2co + 1, the following recursion 



A=0 v ' 



\Je k . 



In order to consider also the number of cherries, we need to refine the previous formula. Let e„j be the 
number of trees in %, having exactly / cherries. Similarly O", is the class of w-trees of size n with / cherries. 
The recursion above becomes then |O w ,| = e„ i if n < 2co + 1 while, when n > 2co + 1, we have to consider 



0) WT\ 



i^i =zz ( k Wi-h-j 

k=0 j=0 v ' 



-l-*-7-;l e *, 



(1) 



Note that we can compute the numbers e„j through a standard Taylor expansion centered at z - of the 
following exponential generating function 



z"x' 



Y(z,x)= V — 

~r „ n ! 

Indeed we have [2}| 



1 + 



2 (x exp (z V-2jc+ l) - x) 



'eU"o«; 



( V-2x+ 1 - l) exp (z V-2x+ l) + V-2 x + 1 + 1 ' 



e„, ; = n!x[z n jc']F(z,jc) 
and the first values are listed in the following table. 



e„,i 


n= 1 


n = 2 


n = 3 


n = 4 


n =5 


« = 6 


n = l 


n = 8 


n = 9 


n = 10 


1= 1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 = 2 








1 


4 


11 


26 


57 


120 


247 


502 


1 = 3 














4 


34 


180 


768 


2904 


10194 


1 = 4 




















34 


496 


4288 


28768 


1 = 5 


























496 


11056 



The recursion defined in ([TJ can be improved by the use of generating functions techniques. This pro- 
vides a much better understanding of the enumerative properties of the trees we are considering. 

Firstly, we characterize the generating function associated with the numbers |fl".|. Infact, it is possible to 
translate the natural "root-subtrees" decomposition of Q-trees into a functional equation which completely 
determines the exponential generating function 

z n x l 



ZZ X 



|r|>o)+l 



In the easiest case co — 1, the recursive decomposition gives for Y\ - 2|r|>2 H~ the following equation 

v _2 2 3 v /-fl+l . J+l.n+2 

xz x z y^^- 2 V * Z / n 

Y\ - — + + > + > x (n + 1), 

2 6 jfJ( n +l) Zj(« + 2)! 

|f|>2 v ' |/|>2 v ' 

which becomes, considering the derivative with respect to z, 

Pl(z,x) 

dY i *** v n 

-r- =xz + -r- +Y\ ■ (1 + xz). 
dz 2 

Similarly F2 = Hw>3 1 i s defined by 

xV xz 3 xV xV ^ x ^" +1 Y^ x l+l z" +2 , ,, ^ x ' +1 ^" +3 (n + 2)! 



xz xz x'z" x"z" y^ xz ri r T . ,. y~* * < 

which gives 



. 3)! X 2n! 

>3 v |r|>3 v |/|>3 v ' 



Pi(z,x) 



dY 2 x 2 z 2 xz 2 x 2 z 3 x 2 z 4 „ /, xz 2 
+ 1 H i-Ft • 1 + xz + — 



dz 2 2 2 8 - \ 2 

The polynomials P\ , P2 in the above differential equations correspond (after integration) to those Q-trees 
which we considered as the starting step of the recursive construction for F u . We have to pay attention to 
those trees we use at the initial stage of the procedure. Indeed observe that, to avoid redundancies in the 
construction, the two subtrees we append to the root of a newly generated tree must be different as ranked 
trees (otherwise we could create wrongly the same tree twice). It follows that each ranked tree t such that 
\t\ < u> must not be counted in the starting step of the procedure and that is why our function F M counts only 
trees with \t\ > ai+l. Once we avoid a certain tree because of the previous reason, we must afterwards insert 
artificially in the mentioned polynomials those trees of size greater than u> which - otherwise - would not be 
created. This process gives rise to the monomials Pi and P2 in the above equations. 

Going a step further, we can say that, for a generic co, the corresponding T w must satisfy an equation of 

the form 

dY 0J 

- P +v ■ V 

dz 
where 

"- z f 4 

and P u is also a polynomial. In particular, 

xz 3 2x 2 z 3 7x 2 z 4 x 3 z 4 x 2 z 5 x 3 z 5 x 2 z 6 x 3 z 6 x 4 z 6 

P-x = 1 + H 1 + h h H , 

6 3 24 6 12 12 72 36 72 

xz 4 llx 2 z 4 x 3 z 4 x 2 z 5 x 3 z 5 5x 2 z 6 x 3 z 6 x 4 z 6 x 2 z 7 5x 3 z 7 x 4 z 7 x 2 z 8 x 3 z 8 x 4 z 8 

Pa = H + h h h 1 H 1 H h + H H 

24 24 6 8 4 144 9 72 144 144 36 1152 144 72 

and, more in general, one has 

D 1„ 2 dV^ 1 



where -^- is the derivative of the monomials associated with ranked trees of size at most to and the remaining 
summands give the derivative of those of size at most 2to + 1 . 
Summarizing we have 

Theorem 1. For a fixed to, the exponential generating function 

z"x l 



v~i z x 

= Y 0i { z ,x) = V — 

, /— ' . n\ 



satisfies 



where 



|/|>m+i 



dY 

—^=P a + Y a ■ V„ with Y w (0, x) = 0, 
dz 



(2) 



v^ zV 1 , dV u 1 

K, = V„(z,x) = > — and P M = P a ( z ,x) = -V* -—£+*--. 

^—i n\ 2 dz 2 

The solution Y u to (f2]i gives, by Taylor expansion, the number of Q-trees of given size n and number of 
cherries I. Results for to - 2 and n < 10 are given in the table below. 



to — 2 


n = 1 


n = 2 


n = 3 


n = 4 


n=5 


n = 6 


n=l 


n = 8 


n = 9 


n = 10 


1= 1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 = 2 








1 


4 


11 


26 


47 


75 


111 


156 


1 = 3 














4 


34 


160 


573 


1677 


4044 


1 = 4 




















24 


346 


2578 


13495 


1 = 5 


























192 


4170 



The defining equation (O will be used in the next sections to describe how Q-trees are distributed in the 
two dimensional («, Z)-space. 

4. Probabilistic properties of Q-trees 

In this section we present some properties of O-trees when considered under the probability distribution 
of the Yule model. First we compute the probability of a O-tree of given size. Then we show that if we 
take a random fi-tree of sufficiently large size n, the expected value (resp. the variance) of the random 
variable L is already for very small to (i.e. to = 2, 3) close to the expected value (resp. the variance) of L for 
un-constrained trees. 

The starting point is the fact that, in terms of generating functions, under the Yule model the probability 
to generate a Q-tree of size n can be expressed as 

P(feQ^) = [z"][F M (2z,l/2)]. 

Furthermore, the expected value E^ w (n) and the variance Var^, w («) are respectively given by 



/ — Tr-r • I = — tt—. = — — ttt^z — : and 



/=() 



Pit e QSO 



E LM {n) 
Var LjW («) E L 2 (n) - (E UtlJ (n)) 1 = 



Pit e 0%) 



U»][F„(2z,l/2)] 



[z»]PW2z, 1/2)] 



+ ElJji) - (E L ,Jn)) z 



4.1. The probability of a D.-tree of given size 

Look first at the probability of a Q-tree of given size n. Considering that 



dY u ( x\ 1 dYofal) 

\ 2Z > 2) ' 



dz \ 2/2 dz 

equation © upon substituting z by 2z and x by x/2 becomes 



dz 
from which we have 



^M^H)+^NKN) 



dY u [lz, \) 

- 2Pj2z, -\ + 2Yj2z,-\-V w \2z, - I (4) 



az \ 2/ \ 2/ \ 2 

Equation (@]i can be re-written as 

dY^ - - 

—^=2P aJ + 2Y W ■ V a , (5) 

dz 

where ? a = Y w {z) = Y l0 (2z,{), P^ = Pj£ = P a (2z,%) and V w = V„{z) = V u (2z, ±) . With boundary 
condition ? w (0) = 0, one has the family of solutions 

t a = exp f 2 f V w <fe) • 2 f exp (-2 f V w (y)dy) P u (y)dy, (6) 

where, for simplicity, we write J f(x)dx instead of C f(w)dw. 

Transfer. Setting 

?: = exp|2jy^), (7) 

we now compute for several values of the parameter to a constant c u such that, for n large enough, 

[z"][? u 



U"][i-;: 



- c u . (8) 



Indeed we observe that F* is solution of 



— H- = 2Y* l0 ■ V u , with Y? o (0) = 1 (9) 

dz 

P w is a polynomial of degree 2a> and, if one takes the derivative in equations (0 and (0 2w+ 1 times, we have 
for both y w and Y^ the same differential equation of order 2co + 2 (with different boundary conditions). It is 
then sufficient to check the desired property (|8) for a finite (and small) number of possible n's to conclude 
that it must hold for all n sufficiently large. 

Take for example co = 2. In this case we have Vo - 1 + z + z 2 , Pi — 2 + z 3 + \ and the two differential 
equations of order 6 which are derived from (|5} and (|9) are 

Yf = 40Yf\z) + 10(1 + 2z)ff\z) + 2(1 + z + z 2 )ff\z), (10) 



with conditions 



F 2 (0) = 0, i* 1} (0) = 0, Yf\0) = 0, Y ( 2 3) (0) = 3! = 6, Y { 2 4) (0) = 4! = 24, ?i, 5) (0) = 5! = 120 



and 



Y* 2 {6) = 40? 2 * <3) (z) + 10(1 + 2z)Y* 2 {4) (z) + 2(1 + z + z 2 )? 2 * (5) (z), 
with conditions 

Y*(0) = 1, f* (1) (0) = 2, ?* <2) (0) = 6, ?* <3) (0) = 24, ?* (4, (0) = 108, T* (5) (0) = 552. 

Now observe that 

fP(0) ?i 4) (0) fi 3) (0) 

^0.2 



*C5), 



.(4), 



v*(3). 



(ID 



7* w (0) 7 2 - '(0) F^ w (0) 

and then, since ( [Tol l and ( fTTT i are linear, the same constant propagates for the ratios involving higher order 
terms. Estimating c^ numerically one finds c^ - 0.22399. 

The same procedure can be applied to other values of co. In the following table we give c w =* ([z n ] [Y w ])/([z n ] [?,*,]) 
whenw= 1,2,3,4,5 





co — 1 


co — 2 


co — 3 


co = 4 


co — 5 


Cu) 


0.311 


0.224 


0.175 


0.143 


0.122 



Through c w we can relate the coefficients of Y^ © with those of y* (|7). Moreover, [z"] [?,*,] can be 
extracted, for n large enough, by standard methods of analytic combinatorics. Indeed, Y^ is an exponential 
of a polynomial with positive coefficients and one can apply results from saddle-point methods (see Flajolet 
and Sedgewick (30): suppose p(z) = a\z + «2Z 2 + ■•■ + o„z" is a polynomial with non-negative coefficients 
and a-periodic, i.e., gcd{j : a j ■ + 0} = 1, then there exists a function r - r(n), which is defined as the positive 
real solution of the equation 

dp(r) 

r • — - — = n, 



dr 



such that 



where 



[z"]exp(p(z)) 



1 exp(p(r)) 



JlnA 



A = A(r) = r ■ 



dp<r) 

dr 



ch- 



in our case, depending on co, we have 



pir) = Po (r) = 2 f VJr)dr = 2 ( "- + "- + . . . + 



.2 

1 2 



w+ I 



CO+ 1 



and 



A{r) = A bJ (r) = 2r (l + 2r + 3r 2 + . . . + (co + l)r<") . 



10 



When cj - 1, 2, r = r w {ri) is 



<„> - r(- 1 + vn *>-i/!4 



Z 

n{n) = - 
o 



4-2 2/3 / / 7v 1/3 

-2 + (28 + 54« + 6 V36 + 84n + 81n 2 ) 

(14 + 27« + 3 V36 + 84« + 8 In 2 ) 



/«y/ 3 1 

~\2) ~3" 

If w > 4, analytic solutions of r • ' p ~^' = n are not available in general but, still, for any fixed n, we can 
compute numerically the value r w (n). In Fig[5]we show the result for a> — 2, 4, 6, 8. Furthermore, when n is 
large, one can approximate rjji) as 

r - w - y - — ■ (i2) 



Indeed, observe that 



= 2r(l +r+ ... + r ) = 



ofr r — 1 

Then, the equation which defines r(«) = r u («) can be written as 

2r u+2 + r(-n - 2) + n = 0. 

Now suppose « large. If divide by «, the equation becomes equivalent to 

2r w+2 



- r + 1 = 0. 



Letting r - {a- n) 1/,( " +1) + ft gives 

2arc(an) 1 ^ +1) + 2aft(faj + 2)n + o(») , /(u+1) 

(an) ly ' - ft + 1 =0 

n 

and then 

2a(a«) 1/(w+1) + 2aft(w + 2) + ^ - (a«) 1/(u+1) - ft + 1 = 0. 

n 

Thus, for « large, the desired equality holds when a = 1/2 and ft = -l/(w + 1) which give r as in ([T2"l >. 

Finally, putting everything together, we have 
Theorem 2. The coefficients of 

Kiz) = e 

satisfy 



M2 ftUz) 



tD+1 \~|2 



[z n l[?*J L , l ' 2 = M+ ' ;J ■ (13) 

2r" V^ (1 + 2r + . . . + (u + l)r w ) 
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Figure 5: Plot of r a (n) for co = 2, 4, 6, 8. 

where r — r(n) is the positive real solution of 

2r(l +r + ,.. + r a )=n 
and asymptotically 



-M-G) 



l/(w+l) 



1 



a>+ 1 



Furthermore, the probability of a Q,-tree of size n under the Yule model is 

P(teQ%) = [z n W (0 ]~c u -[z n l[rj. 

As n grows, the probability P(t e Q.f t ) goes to very fast. For example when a> — 3, if we set n - 30, 
the corresponding value is of order 10~ 4 while, for n = 100, the order is 10~ 25 . This clearly shows that the 
Yule process generates just a small number of Q-trees. 

In the next sections we will focus on the expected value and the variance of the random variable L. Given 
the previous theorem and equation ([13), we will express our results in terms of coefficients of Y^. 

4.2. The expected number of cherries in a random Q.-tree of given size 

Let us now go back to © to compute [z n ] ( — ^ ) _ ■ The mentioned equation can be re-written as 



dz 



-IP +2Y ■ V 



where F u = YJz, x) = Y a {lz, f ), A, = P a (z, x) = P a (lz, f ) and V w = V^z, x) = V M (2z, f ) . As in ©, 
with boundary condition given by Y W (Q, x) - 0, one has solutions 

y„=exp 2 I V w dz\-2 I exp -2 I VJy,x)dy\Pjy,x)dy. 



12 



dY,, 



The expression for -f- is then 



dx 



= 2 



d(jV a dz) 



dx 



Y a (z, x) + H a (z, x), 



(14) 



where 



HJz,x) = exp 2 Vco(z,x)dz 



(*/' 



(15) 



e^y.A-) 



x2 



I eXP (- 2 / 



Vaj(y, x)dy 



t ( 

-2 



d($V a (y,x)d y y 



dx 



P a (y,x) + 



dP a {y,x) 
dx 



dy 



and Quiz, x) is a polynomial of order (co + 1) + 2co — 3co + 1 in z. 
In particular, we also have 



d% 

dx 



d(jv u d Z y 



dx 



Y u (z, 1) + H^iz, 1), 



(16) 



where YJz, 1) = YJ&. 

Observe that H w (z, 1) satisfies 



dHJz, 1) 
dz 



= 2QJz, 1) + WJiz, 1) ■ VM, 1) 



and, given that V w (z, 1) = V w (z), we can apply to H^z, 1) the same trick used before to relate its coefficients 
to those of Y^. Indeed, H^iz, 1) and Y* u satisfy the same linear equation of order 3co + 3. As before, for « 
large enough, the ratio ([z"][i/ w (z, l)])/([z"] [?,*,]) converges to a constant, h w , see the following table. 





C0 = 1 


co — 2 


CO — 3 


co = 4 


co — 5 


hoj 


0.224 


0.155 


0.119 


0.097 


0.082 



We are almost done. If we go back to (fT~6b we have not yet considered the polynomial 2 1 d ™ I 
which multiplies Y a (z, 1). By the definition of V^ and V w we have that 



d(JV u dzy 



dx 



>x=\ 



co (fi/2 

V V J ' z " " ei 

Zj Zj a + i)\ 



=0 V 7=0 

■> t 



+ 1 



j+i 



+-} i + 1 ' 4-i 



(i + 1)! 



(17) 



7=0 



J+I 
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i+1 
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Figure 6: Left: plot of Ei M (n) for a> = 1,2 and of £l,jj(») = (n + l)/3 (line labelled ai = oo). Right: plot of El,3(") _ £/.,«(«)■ 



from which we can compute, for « large enough, the coefficients 

. (dY„(2z,x/2)\ 
U J 



F§^)J~- ^^h[^" m \ 



+ ho J -[z"][Y*J. 



If we now divide by [z"][Y w (2z, 1/2)] we have the desired expected value. 



Theorem 3. The expected value of the number of cherries in a random Q.-tree of size n generated under the 
Yule model is 



E L .aj(n) = 



[jfl [(fiMgSffl)^] [z »-2 ][?:] 2 / « r^-ljr^ 



[z"][7 w (2z, 1/2)] [z»][y*] 
Graphs of eq. ( TT~8T > are drawn in Figj6]for w = 1,2,3. 



1/2)1 ' [z"l[?.*l + 3 Lf-i r Z "i[y* 



Mra j 



(18) 



4.3. r/je variance of the number of cherries for a random Q,-tree of given size 
Given that 



lz n l 



\ d * 2 L 



[z"][YU2z,l/2)] 
the variance of L can be computed as 



E L z,a(n) - E LyU) (n) 



Var t , M («) = E L 2^(n) - (E u Jn)f = —JL£2{=lL + E u Jn) - (E L ,Jn)) 2 



[z"][YJ2z,l/2)) 
Then, all we need is to derive from (fl4t the value of [z n ] \[-tt) _ ■ 
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Using the fact that F w satisfies ( TBI and that H w (z, x) is as in (fl3T > we have 



^ J Af ?«<fe) U , W ^*) 



dx 2 



+2 



t/x 2 
afx 2 



F,„ + 2 



F,„ + 2 



f/.Y 

dx 



dY^ + dHJ^ 
dx dx 



d(JV a dz) 



dx 



Y w + H w (z,x) 



dx 



H a {z, x) + MM, x), 



where 
M M (z, x) 



and 

Thus 



exp 2 I V w (z,x)d: 

x2 f exp ( -2 f t>„i 
[z"][MM, 1)] 



-2 



V(/^,(y,*)d[y) 



<r/.Y 



CLCy, *) + 



dQw(y, x) 
dx 



dy 



[?} 



[z"][Y*J 



d 2 Y a 



=* ko, with k x =* -0.093, k 2 =* -0.057, k 3 =* -0.046, k A =* -0.038, fc 5 - -0.032. 



^ 2 



[z"] 



+4/7, 





B u (z) 






(diJV^dz)) 

,v=l V ) 


2 N 

x=\, 



■c Y* 



( 1 1 f Y\ 



+^-u n ][?:], 



where B^z) is a polynomial of order 2 to + 2 with coefficients ba>,i - [z'][#w(z)]- 
Therefore we have the variance of L as follows 

Theorem 4. The variance of the number of cherries in a random Q,-tree of size n generated under the Yule 
model is 



Var L?6J (n) 



[z n ][Yu(2z,l/2)] 



+ ElJji) ~ (E L , u (n)Y 



f2a)+2 



Z K ' { 



\ ;=o 






4A„ 



i tz"-^ra i r^'ira 



12 [z"][7*] ' 3^ [,»][?*] 



+ — + £l, w («) - (£l, w («)) 2 



where E Lal (n) is as in ( |7<S| ) and the coefficients b^j are given, for co — 1,2,3,4,5, in the following table 
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Figure 7: Left: plot of Var/, w (n) for w = 2, 3 and of Var/,</;(«) = 2(n+ l)/45 (line labelled ai = oo). Right: plot of Var/.3(«)- Var/,j;(n). 





z A 


z> 


z b 


z 1 


z 8 


z y 


z w 


z 11 


z 11 


*i(z) 


1 


















B 2 (z) 


1 


4/3 


4/9 














B 3 (z) 


4/3 


4/3 


16/9 


8/9 


4/9 










B 4 (z) 


4/3 


28/15 


16/9 


20/9 


4/3 


8/9 


4/9 






Bs(z) 


4/3 


28/15 


38/15 


20/9 


8/3 


16/9 


4/3 


8/9 


4/9 



In Fig|7]we plot Var^„(«) for w = 2, 3; we also show the difference Var^n) - Var^(n). 

To conclude our analysis we compare the entire distribution of the random variable L for ranked trees 
and Q-trees (see Fig. [8}: they essentially coincide for n moderately large. Recall that in the un-constrained 
case the distribution is asymptotically Gaussian (see McKenzie and Steel Q6D). 

5. Conclusions and further directions 

In this work we investigated the possibility to reduce the size of tree-space of Yule-trees by introducing 
a strongly restrictive topological condition while still maintaining representative statistical features of Yule- 
trees. The proposed u restriction was given in terms of imbalance constraints which reduces the variety of 
possible subtree shapes permitted in a constrained tree. For the statistic number of cherries, we have shown 
that the resulting subclasses are representative of the entire class. It is remarkable that this is true even if the 
imposed constraint is very strong. 

For sufficiently large co, all ranked-trees of size n are Q-trees. It is then natural to ask, for any given 
statistic cr, what is the minimum value of u> - a> lT which makes the associated trees representatives of the 
un-constrained class. We have here studied in detail the case cr — L — L\. In principle, analogous results can 
be obtained if cr — Lk {k > 1), i.e. when the statistic in question is the number of subtrees of size lc. We have 
shown (see Fig.|9| that, for instance, the random variable L 2 , i.e., the number of pitchforks (see Rosenberg 
18J) in a Yule-generated ranked tree, has an expectation which is very close to that of un-constrained trees 
already for a> — 3 and if n is moderately large (« < 50). 

In order to better explore the representative power of Q-trees, one would require an efficient algorithm to 
generate them in a way which respects the probability distribution of the Yule process. A rejection method 
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Figure 8: Distribution of L for ranked trees and fi-trees of size n = 50. 
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Figure 9: Expected value of L2 for CI -trees versus ranked trees. 
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based on a previous random generation of un-constrained trees is not efficient when a> is small with respect 
to the size, because numbers are prohibitive: for example, for to - 3 the probability of a O-tree of size (just) 
n - 50 is of the order 1CT 25 . 

Finally, we remark that well-defined constraints which maintain statistical properties should be of interest 
in the design of efficient algorithms to search tree-space and we suggest that this field of research deserves 
further investigation. 
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